In this course, Introduction to Open Data Science, we will hopefully learn to utilize some methods of open data science. We will use the R programming language to write codes to analyze different types of problems as well as GitHub to store and publish our final products. If you wish to see my GitHub repository, please click here!
In this exercise, I have earlier created a data set based on the data collected from Johdatus yhteiskuntatilastotieteeseen (autumn 2014) course. The original study examined study success, motivation and learning habits of social science students.
Let us start by reading the data file and studying its properties:
lrn14 <- read.table("data/learning2014.txt", sep="\t", header=TRUE)
We can study the dimension of the data by writing:
dim(lrn14)
## [1] 166 7
It seems that the data set contains 7 columns and 166 rows.
And to study the structure:
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The first two columns introduce some basic information about the students, ie. genders and ages, respectively. The third one is an average of ten questions on Likert scale describing students’ attitude toward statistics, and the following three columns measure subjects’ deep, strategical and surface learning, respectively. Finally, the last one contains the exam points.
To further study the data set, we can use summary() command to calculate some basic statistical properties, such as mean or median:
summary(lrn14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
On the other hand, we can also make pair plots between to describe relations between given variables. Note that here the blue color represents males and the red one females.
# pair plots
ggpairs(lrn14, mapping = aes(alpha = 0.3, col = gender), upper = list(continuous = wrap("cor", size = 3)), lower = list(combo = wrap("facethist", bins = 20)))
What is surprising is that most of the distributions are, from certain extent, bell-shaped, or at least, mainly unimodal. Furthermore, there exist quite strong positive correlation between exam score and student’s attitude toward statistics which is not unexpected. However, there also exists a mild negative correlation between surface and deep learning which is, at first glance, weird.
Let us then study how the exam score depends on the other variables. Based on the previous figure the most promising candidates for explanatory variables are attitude (cov = 0.437) as well as strategical (0.146) and surface learning (-0.144). By using linear, multiple regression, we find out that:
# linear regression model
regression_model3 = lm(points ~ attitude + stra + surf, data = lrn14)
# summary of our model
summary(regression_model3)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Here, the lm function is using one-sided Student’s t-test to test the null hypothesis that a fitting parameter is zero. The test assumes that the underlying distribution is normal, and it gives out a p-value represent the probability the null hypothesis is true within the given model. Together with a chosen significant level, one can state is there a significant variation from the null hypothesis or not.
Therefore, based on the summary, the only explanatory variable which is statistically significant, ie. the p-value is smaller than the significant level \(\alpha\), is the attitude, when \(\alpha = 0.05\). Therefore, we may discard other two variables and do the analyze again:
regression_model2 = lm(points ~ attitude, data = lrn14)
summary(regression_model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The obtained fit seems to be a bit better, especially in case of the y-intercept.
Using the given summary, one is able to see that the slope coefficient is positive. This indicates that there is a positive correlation between person’s exam score and their attitude towards statistics, ie. motivated students get better grades. On the other hand, this result also revels that the exam score depends on strategical and surface learning parameters only weakly.
The multiple R-square tells us how well the fitted linear function describes the given data, and its value is always between zero and one. So in general case, higher the R-square better is the model.
The multiple R-squared values of the presented models are quite small (0.21 and 0.19, respectively) but not close to zero. As stated above, small value typically indicates that the model does not explain the behavior of the target variable. However, the result is not unpredictable because the population consist of humans, and humans do not statistically behave as well as elementary particle, for example. Furthermore in our latter model, the statistical significants of model constants are huge, and hence, we can state that fit is faithful according to this diagnostic.
The assumptions of the linear regression model are follow: * Linear regression really exist * Errors are normally distributed with constant variance * Errors are not correlated * Errors do not depend on model variables.
#Plotting residuals vs. fitted values (1), normal QQ-plot (2) and residuals vs leverage (5)
plot(regression_model2, which = c(1,2,5))
In the residual vs. fitted data plot, everything seems to be mostly just fine, and distinguishable separation from the ground level does not appear. Nonetheless, three data points stand out (#35, #56 and #145). Even so, this diagnostic plot indicates that the situation is truly linear.
Correspondingly, the normal QQ-plot continues the story, but now there exists some deviation from the baseline at both ends. Nonetheless, the underlying distribution can be still seen reasonable normal. Finally, the residuals vs. Leverage plot can be used to study the last assumption, and because we do not see any extreme outliers, we can state that the assumption holds.
Based on the diagnostic plots, one may state that the assumptions of the given model are valid.
This week we are studying logistic regression.
During this exercise, we are studying a data set which contains information about students’ grades and factors affecting on them. If you want to have some more information about the set, see the original study.
Let us start by reading the data file ‘alc.txt’ and studying its properties:
alc <- read.table("data/alc.txt", sep="\t", header=TRUE)
Let us then see which kind of variables the file contains
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
You may see 35 different variables, such as ‘age’ or ‘sex’. But in this exercise we are interested in to study alcohol consumption, and therefore, ‘alc_use’ and ‘high_use’ are the relevant variables. Here, ‘alc_use’ is the average of weekday (Dalc) and weekend (Walc) comsumption on Likert scale. Besides, the ‘high_use’ variable tells use when the alcohol consumption is high, ie. ‘alc_use’ > 2.
Note that the used data set is an intersection of two data set provided by the original study. Hence, all numerical variables are average values.
I have four hypothesis related to high alcohol consumption:
1. People whose final scores (‘G3’) are low, are liker to use a lot of alcohol.
2. The alcohol consumption of young (‘age’) people is high.
3. Good relationships (‘famrel’) with your family members prevent drinking.
4. People who use lots of alcohol fail courses (‘failures’) more often.
To study these hypothesis, I will first
In this section, we will study the first hypothesis “people whose final scores (‘G3’) are low, are likier to use a lot of alcohol.” Lets us first make a box plot
g1 <- ggplot(alc, aes(x = high_use, y = G3))
g1 + geom_boxplot() + ylab("Final grade") + xlab("High usage of alcohol")
and a bar one:
counts <- table(alc$high_use, alc$G3)
barplot(counts, col=c("darkblue","red"), beside=T, legend = rownames(counts), xlab = "Final grade")
Based on these plots, it seems that the mean final score is indeed higher within students who do not use a lot of alcohol. Nevertheless in that case, the variance is also notable higher which is a bit suprising. But based on these findings we can state that most probably there is not any tendency towards the hypothesis.
Let us then study the hypothesis number 2: “The alcohol consumption of young (‘age’) people is high” by making a box and a bar plots:
g1 <- ggplot(alc, aes(x = high_use, y = age))
g1 + geom_boxplot() + ylab("Age") + xlab("High usage of alcohol")
counts <- table(alc$high_use, alc$age)
barplot(counts, col=c("darkblue","red"), beside=T, legend = rownames(counts), xlab = "Age")
What is notable is that the students are quite young, the youngest students are only 15 years old. (NB The legal drinking age is 16 in Portugal.) There for it is not surprising that the younger people are not heavy drinkers, ie. the mean age of the heavy drinkers is two years higher. Therefore, it is clear that our hypothesis is incorrect, and it is very likable that our hypothesis would have been more successful if the youngest students of the school were a bit older than 15.
The third hypothesis states “Good relationships (‘famrel’) with your family members prevent drinking.”
g1 <- ggplot(alc, aes(x = high_use, y = famrel))
g1 + geom_boxplot() + ylab("Relationship between a student and their family") + xlab("High usage of alcohol")
counts <- table(alc$high_use, alc$famrel)
barplot(counts, col=c("darkblue","red"), beside=T, legend = rownames(counts), xlab = "Relationship between a student and their family")
Based on the plots it seems like the statement is some what true. Even to the average (4) relationship in Likert scale is the same on both cases, the distribution is, in the case of low alcohol consumption, tilted toward the highest value. Conversely in the other case, the distribution is around the average value.
The last hypothesis is given as: “People who use lots of alcohol fail coerces (‘failures’) more often.”
xtabs(~ failures + high_use, data = alc)
## high_use
## failures FALSE TRUE
## 0 244 90
## 1 12 12
## 2 10 9
## 3 2 3
Based on the cross-tabulation given above, it is easy to see that there is not any notable correlation between the usage of alcohol and the number of failures of the course.
Let us fit a logistic regression model which uses our chosen variables from the previous section to explain the high/low usage of alcohol, and then let us see the model summary
logistic_model <- glm(high_use ~ G3 + age + famrel + failures, data = alc, family = "binomial")
summary(logistic_model)
##
## Call:
## glm(formula = high_use ~ G3 + age + famrel + failures, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.542 -0.829 -0.706 1.347 1.873
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.14681 1.80022 -1.193 0.2331
## G3 -0.04474 0.03728 -1.200 0.2301
## age 0.16300 0.10009 1.629 0.1034
## famrel -0.25594 0.12403 -2.064 0.0391 *
## failures 0.36424 0.19818 1.838 0.0661 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 448.27 on 377 degrees of freedom
## AIC: 458.27
##
## Number of Fisher Scoring iterations: 4
Based on the summary, we can see that the family relationship (‘famrel’) is the only variable which p-value is under the 0.05 significant level. Hence, we can state the the findings of the previous are line with this analyze, or in other words, the only statistically obvious relationship is between high/low alcohol usage and the family relationship.
Let us present the model coeffiecients and calculate their confidence intervals:
odd_ratio <- coef(logistic_model) %>% exp
conf_intervals <- confint(logistic_model) %>% exp
## Waiting for profiling to be done...
cbind(odd_ratio, conf_intervals)
## odd_ratio 2.5 % 97.5 %
## (Intercept) 0.1168569 0.003334282 3.945043
## G3 0.9562495 0.888423175 1.028613
## age 1.1770412 0.968307038 1.434907
## famrel 0.7741855 0.606291127 0.987785
## failures 1.4394173 0.975965957 2.136111
The odd ration of the variables ‘G3’ and ‘famrel’ seem to be (about) below one and the other two (about) above on. However, the variable ‘famrel’ is the only one which clearly below one, ie. it support that there is a negative association with the high/low drinking factor. This finding also support our hypothesis number 3.
On the other hand, the variable ‘failures’ is the only (quite) quantity clearly above one. This result indicates that the number of failed classes is higher is the person is drinking a lot. The p-value of the variables is quite close to the significant level but just above it (0.066). This piece of information also points to the direction that the there is a connection but not a very visible one.
Based on the analyses given in the previous section, the only variable which gives statistically significant relationship with the high/low consumption of alcohol is ‘famrel’. Nonetheless, we are also including the almost statistically significant parameter ‘failures’ (because this would be boring without).
So let us first define the new logistic regression model
logistic_model_updated <- glm(high_use ~ famrel + failures, data = alc, family = "binomial")
summary(logistic_model_updated)
##
## Call:
## glm(formula = high_use ~ famrel + failures, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4132 -0.7926 -0.7926 1.3890 1.7302
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.009978 0.490131 -0.020 0.98376
## famrel -0.246686 0.122493 -2.014 0.04402 *
## failures 0.511941 0.180980 2.829 0.00467 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 452.95 on 379 degrees of freedom
## AIC: 458.95
##
## Number of Fisher Scoring iterations: 4
Now, it seems that the variable ‘failures’ fits well which was not the case in previous section.
Then let us then make a 2x2 cross tabulation of predictions versus the actual values:
# predict() the probability of high_use
probabilities <- predict(logistic_model_updated, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 254 14
## TRUE 104 10
As you can see the model works pretty well, ie. in 264 cases the model gives the right results. Nonetheless, there is 14 cases when false positive and 104 cases when false negative results occurs. All in all, we cannot say that this model is perfect.
Besides, we may also make a plot to visualize this results:
# Some plotting libraries
library(dplyr)
library(ggplot2)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
Finally, let us calculate the training error:
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3089005
This results indicates that error is about 0.31. The error is somewhat large but it is reasonable because the p-value of variable ‘famrel’ is close to the significant level, 0.05. Furthermore, we have to also remember that the data set is describing some subset of human population and therefore, the error is excepted to be a bit higher.
Hopefully, I am going to learn something about clustering and classification theory.
In this exercise, we will study the so called Boston data set. This set contains information about “housing values in suburbs of Boston” (see the meta file).
Let us first read the Boston data from the MASS package:
# let us load the object 'Boston'
data(Boston)
Then we might study it dimensions:
dim(Boston)
## [1] 506 14
and its structure:
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
It seems that the data set contains 506 observations (rows) and 14 variables (columns). The definition of variables can be found from the meta file.
Let us then study the summary of the given data set:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
and make a graphical interpretation of the distribution:
ggpairs(Boston)
and present the correlation plot as well:
# correlation matrix with rounding
cor_matrix<-cor(Boston) %>% round(digits = 2)
# correlation plot, upper triangle
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
If we study the distributions, we may see that many of them are bimodular ad the modes are far apart, such as ‘indus’, ‘tax’ and ‘rad’. On the other hand, one can also find distribution which are heavily skewed, e.g. ‘black’, ‘crim’ and ‘zn’. Correlation wise, the situation is interesting because various strong correlation can be detected, and interestingly, only a few situation suggest a small (absolute value of) correlation. (NB One should note that the variable ‘chas’ is a dummy variable and should be ignored.)
Let us then scale the Boston data using function ‘scale’:
# scaling
boston_scaled <- scale(Boston)
# summary
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
As one can easily see, the mean value of every distribution is now zero, ie. the distributions are centered. Also as the name of the function ‘scale’ indicates, the achieved centered distributions are scaled dividing by the standard deviations. This helps us to compare the shapes of these distributions.
Let us then replace the original crime rate with a categorical one:
# because the class of the end results of function 'scaled' is a matrix
# we need to change its class to data frame
boston_scaled <- as.data.frame(boston_scaled)
# let us find the quantiles of the crime rate
bins <- quantile(boston_scaled$crim)
# creating categorical variable of the crime rate with labeling using quantiles
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
# removing original crime rate
boston_scaled <- dplyr::select(boston_scaled, -crim)
# and adding the new one
boston_scaled <- data.frame(boston_scaled, crime)
Next, we divide the given data into a train and a test set. Here, we are demanding that 80 % of the data points are part of the train set.
# Number of observations
number_rows <- nrow(boston_scaled)
# picking up 80 % of the data points
row_index <- sample(number_rows, size = 0.8 * number_rows)
# train set
train <- boston_scaled[row_index,]
# test set
test <- boston_scaled[-row_index,]
Let us then create a fit on the train data using linear discriminant analysis (LDA). Because the data is scaled beforehand, we may assume that it fulfills normality and and the population variance is the same for each variable. Here, the crime rate (‘crime’) will be the target variable and all the other variables are treating as prediction variables.
# lda
lda_fit <- lda(crime ~ ., data = train)
Let then see the results in LDA biplot:
# transforming crimes classes into numerical ones
crime_classes <- as.numeric(train$crime)
# lda biplot
plot(lda_fit, dimen = 2, col = crime_classes, pch = crime_classes)
Let us the save the crime rate variable ‘crime’ from the ‘tets’ data set into a separate variables ‘test_crime’ and then remove the variable from the set:
# saving the crime variable from the test data set
test_crime <- test$crime
# removing the variable
test <- dplyr::select(test, -crime)
Then, we may test your fit gotten from the previous section with the ‘test’ data:
# prediction
lda_prediction <- predict(lda_fit, newdata = test)
Let us make a cross table to study the results:
table(correct = test_crime, predicted = lda_prediction$class)
## predicted
## correct low med_low med_high high
## low 11 18 2 0
## med_low 2 16 2 0
## med_high 0 10 12 1
## high 0 0 0 28
We can see that the highest crime rate value was predicted perfectly as well as the the lowest one agreed with the data excellently. However, if we are studying the last two categories, we may notice that the output of our model is not that good, especially in case of ‘med_low’ variable. Nonetheless, this results is not a big surprise because the LDA biplot indicates that there is notable overlapping between ‘low’, ‘med_low’ and ‘med_high’ variables.
Let us first reload and rescale the Boston data set:
# reloading Boston data set
data(Boston)
# standardizing it
scaled_boston_new <- scale(Boston) %>% as.data.frame()
Then, we can calculate the (Euclidean) distance between observations with a summary of the results:
dist_euclidean <- dist(scaled_boston_new)
summary(dist_euclidean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next, we want to divide the data into cluster by using k-means algorithm. Let us try it using 4 clusters
# k-means
k_means1 <-kmeans(Boston, centers = 4)
# pair plot
pairs(Boston, col = k_means1$cluster)
Now, we have to figure out how many cluster we really need:
# maximum number of clusters
cluster_max <- 10
# total within sum of squares
twcss <- sapply(1:cluster_max, function(k){kmeans(Boston, k)$tot.withinss})
# let us see the result
qplot(x = 1:cluster_max, y = twcss, geom = 'line')
Based on the figure, we can see that there is a dramatical change between one and two. Therefore, we state that the optimal number of cluster is two.
Finally, let us plot the end results, ie. the cluster as a pair plot:
# k-means
k_means2 <-kmeans(Boston, centers = 2)
# pair plot
pairs(Boston, col = k_means2$cluster)
Here, the different cluster are separated using colors.
This week’s topic is dimensionality reduction techniques. So, let us get rid of those dimensions.
In this exercise, we will use study a data set which is originally created by the United Nations Development Programme (see http://hdr.undp.org/en/content/human-development-index-hdi). However, we have earlier reformed the set to corresponds our needs (see file create_human.R for details).
To be able to study the data set, we need to load it:
human <- read.table("data/human_updated.txt", sep = "\t", header = T)
Then, we can examine its summary:
summary(human)
## sec_education_ratio labour_ratio exp_education life_exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni maternal_mortality birth_rate parliament_perc
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
and make a graphical overview:
ggpairs(human)
All distributions seems to be monomodular, to some extent. Nevertheless, the distributions are not bell-shaped but typically skewed, especially in case of ‘gni’ and ‘maternal_mortality’ when the skewness is notable. Furthermore, there exist large correlations (both negative and positive ones) between given distributions. For example, the birth rate (‘birth_rate’) seems to be heavily correlated with almost all other variables except with ‘parliament_perc’ and ‘labour_ratio’. Other strong correlations can also be wound, such as between life expectancy (‘life_exp’) and maternal mortality (‘maternal_mortality’).
Let us then do the principal component analysis (PCA) on the non-standardized data set:
pca_human <- prcomp(human)
Then, we can
pca_summary1 <- summary(pca_human)
pca_summary1
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
# rounded variances
pca_pr1 <- round(100*pca_summary1$importance[2, ], digits = 1)
# axis labels
pc_lab1 <- paste0(names(pca_pr1), " (", pca_pr1, "%)")
# biplot of the first to principal components
# NB the arrows represents the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab1[1], ylab = pc_lab1[2])
The PCA biplot when the non-standardized data set is used. Clearly, the first principal components (PC1) follows the gross national income per capita (gni) variable.
Based on the above table, one can easily see that the first principal components (PC1) is dominant and explains about 99.99 % of the data. By looking the biplot, one can on other hand see that the PC1 is related to the ‘gni’ parameter. This is natural because the numerical value variable ‘gni’ is much larger than the values of other variables. Therefore, we can state that one needs to standardize the data before performing the PCA.
Now, we can standardize the data and then do the same analyses:
# scaling
human_standar <- scale(human)
# pca
pca_human2 <- prcomp(human_standar)
# summary
pca_summary2 <- summary(pca_human2)
pca_summary2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# rounded variances
pca_pr2 <- round(100*pca_summary2$importance[2, ], digits = 1)
# axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")
# biplot of the first to principal components
# NB the arrows represents the original variables
biplot(pca_human2, choices = 1:2, cex = c(0.5, 0.7), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2])
The PCA biplot when the standardized data set is used. Now, the first principal components (PC1) does not follow the gross national income per capita (gni) variable but it is a combination of multiple pramaeters.
As one can see from the above table and biplot, the PCA of the standardized data set is a much realistic one. For instance, the PC1 is not explained by just one singe parameter but it is set by 6 different variables. This is due to the scaling which does not overhighlights the GNI parameters.
Based on the PCA plot from standardized data, the PC1 seems to contains variables related to the wealth of the country. On the one hand, there are the maternal mortality and the birth rate which indicates that the country is economically under developed. One the other hand, the other variables, ie. school success, life expectancy and gross national income per capita, gives an opposing indication.
The two major explaining parameters of the second principal components (PC2) are the number of the females in parliament and labour power (per cent). Therefore, this principal component seems to be related to the gender equality.
I could not study the asked data set because there is a some kind of conflict when I try to install the asked package (mostly with ‘libcurl4-openssl-dev’ package). Very strange… Well, I do not have time to solve this issue…